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Abstract 


This paper explores a surprising equivalence between two seemingly-distinct convex opti¬ 
mization methods. We show that simulated annealing, a well-studied random walk algorithms, 
is directly equivalent, in a certain sense, to the central path interior point algorithm for the the 
entropic universal barrier function. This connection exhibits several benefits. First, we are able 
improve the state of the art time complexity for convex optimization under the membership 
oracle model. We improve the analysis of the randomized algorithm of Kalai and Vempala [7] 
by utilizing tools developed by Nesterov and Nemirovskii m that underly the central path fol¬ 
lowing interior point algorithm. We are able to tighten the temperature schedule for simulated 
annealing which gives an improved running time, reducing by square root of the dimension 
in certain instances. Second, we get an efficient randomized interior point method with an 
efficiently computable universal barrier for any convex set described by a membership oracle. 
Previously, efficiently computable barriers were known only for particular convex sets. 
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1 Introduction 


Convex optimization is by now a well established field and a cornerstone of the helds of algorithms 
and machine learning. Polynomial time methods for convex optimization belong to relatively few 
classes: the oldest and perhaps most general is the ellipsoid method with roots back to Kachiyan in 
the 50s [Ij. Despite its generality and simplicity, the ellipsoid method is known to perform poorly 
in practice. 

A more recent family of algorithms are the celebrated interior point methods, initially developed 
by Karmarkar in the context of linear programming, and generalized in the seminal work of Nesterov 
and Nemirovskii |16) . These methods are known to perform well in practice and come with rigorous 
theoretical guarantees of polynomial running time, but with a signihcant catch: the underlying 
constraints must admit an efhciently-computable self-concordant barrier function. Barrier functions 
are dehned as satisfying certain differential inequality conditions that facilitate the path-following 
scheme developed by Nesterov and Nemirovskii [16], in particular it guarantees that the Newton 
step procedure maintains feasibility of the iterates. Indeed the iterative path following scheme 
essentially reduces the optimization problem to the construction of a barrier function, and in many 
nice scenarios a self-concordant barrier is easy to obtain; e.g., for polytopes the simple logarithmic 
barrier suffices. Yet up to the present there is no known universal efficient construction of a barrier 
that applies to any convex set. The problem is seemingly even more difficult in the membership 
oracle model where our access to JC is given only via queries of the form “is x G /C?”. 

The most recent polynomial time algorithms are random-walk based methods, originally pio¬ 
neered in the work of Dyer et al. |3] and greatly advanced by Lovasz and Vempala m- These 
algorithms apply in fnh generality of convex optimization and require only a membership oracle. 
The state of the art in polynomial time convex optimization is the random-walk based algorithm of 
simulated annealing and the specihc temperature schedule analyzed in the breakthrough of Kalai 
and Vempala [7j. Improvements have been given in certain cases, most notably in the work of 
Narayanan and Rakhlin |I4] where barrier functions were utilized. 

In this paper we tie together two of the three known methodologies for convex optimization, give 
an efficiently computable universal barrier for interior point methods, and derive a faster algorithm 
for convex optimization in the membership oracle model. Specihcally, 

1. We dehne the heat path of a simulated annealing method as the (determinisitc) curve formed by 
the mean of the annealing distribution as the temperature parameter is continuously decreased. 
We show that the heat path coincides with the central path of an interior point algorithm with 
the entropic universal barrier function. This intimately ties the two major convex optimization 
methods together and shows they are approximately equivalent over any convex set. 

We further enhance this connection by showing that the central path following interior point 
method applied with the universal entropic barrier is a hrst-order approximation of simulated 
annealing. 

2. Using the connection above, we give an efficient randomized interior point method with an 
efficiently computable universal barrier for any convex set described by a membership oracle. 
Previously, efficiently computable barriers were known only for particular convex sets. 

3. We give a new temperature schedule for simulated annealing inspired by interior point methods. 
This gives rise to an algorithm for general convex optimization with running time of O(yYn^), 
where v is the self-concordance parameter of the entropic barrier. The previous state of the 
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art is 0{n^-^) by |7]. We note that our algorithm does not need explicit access to the entropic 
barrier, it only appears in the analysis of the temperature schedule. 


It was recently shown by Bubeck and Eldan [2] that the entropic barrier satisfies all require 
self-concordance properties and that the associated barrier parameter satisfies < n(l -|- o(l)), 
although this is generally not tight. Our algorithm improves the previous annealing run time by 
a factor of which in many cases is o(l). For example, in the case of semi-definite pro¬ 
gramming over matrices in the entropic barrier is identically the standard log-determinant 

barrier [5], exhibiting a parameter v = n, rather than n^, which an improvement of n compared 


to the state-of-the-art. More details are given in section 4.4 


The Problem of Convex Optimization For the remainder of the paper, we will be consider¬ 
ing the following algorithmic optimization problem. Assume we are given access to an arbitrary 
bounded convex set JC C and we shall assume without loss of generality that JC lies in a 2-norm 
ball of radius 1. Assume we are also given as input a vector 9 G W^. Our goal is to solve the 
following: 

TtmiO^ X. (1) 

xe/c 

We emphasize that this is, in a certain sense, the most general convex optimization problem one 
can pose. While the objective is linear in x, we can always reduce non-linear convex objectives 
to the problem 0. If we want to solve minj;gyc f{x) for some convex / : /C —)• M, we can instead 
define a new problem as follows. Letting 1C' := {{x,c) G /C x M : f{x) — c < 0}, this non-linear 
problem is equivalent to solving the linear problem minK^, c- This equivalence is true in in 
the membership oracle model, since a membership oracle for K, immediately implies an efficient 
membership oracle for K,'. 


1.1 Preliminaries 

This paper ties together notions from probability theory and convex analysis, most definitions are 
deferred to where they are first used. We try to follow the conventions of interior point literature 
as in the excellent text of Nemirovski |15j . and the simulated annealing and random-walk notation 

of [7|. 

For some constant (7, we say a distribution P is C-isotropic if for any vector v G we have 

h\vf< E [(r;Tx)2] < C||uf. 

C 


Let P,P' be two distributions on M"' with means /r,respectively. We say 
respect to P' if 


- E [{v^Xf] < E < C E [(^^"^71)2]. 

C x~p x~P' 


P is C'-isotropic with 


One measure of the distance between two distributions, often referred to as 
by 


vr 


E 





the £2 norm, is given 


We note that this distance is not symmetric in general. 
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For a differentiable convex function / : M” —>■ M, the Bregman divergence Df(x,y) between 
points x,y £ dom(/) is the quantity 

Dfix, y) = fix) - f{y) - V f{y)'^ [x - y). 

Further, we can always define the Fenchel conjugate (or Fenchel dual) [T7] which is the function 
f* defined as 

f*{9) := sup,,gdom(/) - fix). (2) 

It is easy to see that /*(•) is also convex, and under weak conditions one has f** = /. A classic 
duality result (see e.g. m) states that when f* is smooth and strictly convex on its domain and 
tends to infinity at the boundary, we have a characterization of the gradients of / and f* in terms 
of maximizers; 

V/*(0) = argmax0'''x — fix) and V/(x) = argmax 6~^ x — f*iO). (3) 

3;Sdom{/) 6Sdom(/*) 


1.2 Structure of this paper 

We start by an overview of random-walk methods for optimization in the next section, and introduce 
the notion of the heat path for simulated annealing. The following section surveys the important 
notions from interior point methods for optimization and the entropic barrier function. In section 
l^we tie the two approaches together formally by proving that the heat path and central path are 
the same for the entropic barrier. We proceed to give a new temperature schedule for simulated 
annealing as well as prove its convergence properties. In the appendix we describe the Kalai- 
Vempala methodology for analyzing simulated annealing and its main components for completeness. 


2 An Overview of Simulated Annealing 


Consider the following distribution over the set JC for an arbitrary input vector 6 £ M”'. 


Peix) 


exp(— 

exp(— dx' * 


(4) 


This is often referred to as the Boltzmann distribution and is a natural exponential family parame¬ 
terized by 0. It was observed by [7] that the optimization objective 0 can be reduced to sampling 
from these distributions. That is, if we choose some scaling quantity t > 0, usually referred to as 
the temperature, then any sample X from the distribution must be nt-optimal in expectation. 
More precisely, [7] show that 

E X] — X < nt. (5) 

x&K 

As we show later, our connection implies an even stronger statement, replacing n above by the 
self-concordant parameter of the entropic barrier, as we will dehne in the next section equation 

(T§. 

It is quite natural that for small temperature parameter t G M, samples from the Pa are 

t 

essentially optimal - the exponential nature of the distribution will eventually concentrate all 
probability mass on a small neightborhood around the maximizing point x* £ 1C. The problem, of 
course, is that sampling from a point mass around x* is precisely as hard as finding x*. 
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This brings us to the technique of so-called simulated annealing, originally introduced by Kirk¬ 
patrick et al. [9] for solving generic problems of the form miUx^ic f (x), for arbitrary (potentially 
non-convex) functions /. At a very high level, simulated annealing would begin by sampling from 
a “high-entropy” distribution (t very close to 0), and then continue by slowly “turning down the 
temperature” on the distribution, i.e. decreasing t, which involves sampling according to the pdf 
Qf^t{x) oc exp(—i/(x)). The intuition behind annealing is that, as long as t'/t is a small constant, 
then the distributions Qf^t' and Qj^t will be “close” in the sense that a random walk starting from 
the initial distribution Qf^t' will mix quickly towards its stationary distribution Qf^t- 

Since its inception, simulated annealing is generally referred to as a heuristic for optimization, 
as polynomial-time guarantees have been difficult to establish. However, the seminal work of Kalai 
and Vempala [7] exhibited a poly-time annealing method with formal guarantees for solving linear 
optimization problems in the form of 0 . Their technique possessed a particularly nice feature: 
the sampling algorithm utilizes a well-studied random walk (Markov chain) known as HitAndRun 
|18l 1101 [T3] . and the execution of this Markov chain requires only access to a membership oracle 
on the set 1C. That is, HitAndRun does not rely on a formal description of JC but only the ability 
to (quickly) answer queries “x G /C?” for arbitrary x G 

Let us now describe the HitAndRun algorithm in detail. We note that this Markov chain 
was first introduced by Smith [18], a poly-time guarantee was given by Lovasz [TO] for uniform 
sampling, and this was generalized to arbitrary log-concave distributions by Lovasz and Vempala 
m- HitAndRun requires several inputs, including: (a) the distribution parameter 6, (b) an 
estimate of the covariance matrix S of Pq, (c) the membership oracle O/c, for JC, (d) a starting 
point Xq, and (e) the number of iterations N of the random walk. 


Algorithm 1: HitAndRun( 0, Oyc, iV, S, Aq) for approximately sampling 


1 

2 

3 

4 
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Inputs: parameter vector 9, oracle O/c for 1C, covariance matrix S, ^(titerations N, initial 
Ao G A. 

for i = \,2,... ,N do 

Sample a random direction u A(0,S) 

Querying Ojc, determine the line segment R = {Aj_i + pu : p G M} H A 
Sample a point Aj from R according to the distribution Pg restricted to R 


6 Return A^r 


The first key fact of HitAndRun(0) is that the stationary distribution of this Markov chain is 
indeed the desired Pg ; a proof can be found in |19j . The difficulty for this and many other random 
walk techniques is to show that the Markov chain “mixes quickly”, in that the number of steps N 
needn’t be too large as a function of n. This issue has been the subject of much research will be 
discussed below. Before proceeding, we note that a single step of HitAndRun can be executed 
quite efficiently. Sampling a random gaussian vector with covariance S (line can be achieved by 
simply sampling a standard gaussian vector z and returning Computing the line segment R 

(linej^ requires simply finding the two locations where the line {Xi-i + pu : p G M} intersects with 
the boundary of A, but an e-approximation of these points can be found via binary search using 
O (log T) queries to Ojc. Sampling from Pg restricted to the line segment R can also be achieved 
efficiently, and we refer the reader to Vempala m- 

The analysis for simulated annealing in [7] proceeds by imagining a sequence of distributions 
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Figure 1: The progression of two Hit-and-Run random walks for a high temperature (red squares) 
and a low temperature (black circles). Notice that at low temperature the walk coverges very 
quickly to a corner of /C. 


where ti = i? is the diameter of the set K, and ■ Let k = 0{^/n\og ^), 

then sampling from is enough to achieve the desired optimization guarantee. That is, via 
Equation we see a sample from is e-optimal in expectation. 

To sample from Pg^, [7] construct a recursive sampling oracle using HitAndRun. The idea is 
that samples from can be obtained from a warm start by sampling from Pg^ according to a 

carefully chosen temperature schedule. The details are given in Algorithm 


Algorithm 2: SimulatedAnnealing with HitAndRun - Kalai and Vempala [7] 


1 Input: temperature schedule {tk-,k G [T]}, objective 6. 

2 Set Xq = 0, El = I, ti = P 

3 for k = 1, ..., T do 


^ ^ 

Xk ^ HiTANDRuN(6»fc, Ok:, N, Ek,Xk-i) 

for j = 1,.., n do 

L YI = HlTANDRUN(0fc,Oyc,iV,Sfc,y^^_,) 
Estimate covariance: Sfc+i := CovMtx(Y’/',. 


9 Return Xt 


5 n 


The Kalai and Vempala [7] analysis leans on a number of technical but crucial facts which they 
prove. The temperature update schedule that they devise, namely tfc = (1 — ■^)tk-i, is shown to 
satisfy these iterative rules and thus return an approximate solution. 
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Theorem 1 (Key result of Kalai and Vempala [3 and Lovasz and Vempala |11]1. Fix k and 
consider the HitAndRun walk used in Algorithm]^ to compute and for each j. Assume we 
ehoose the temperature schedule in order that suecessive distributions are close in ^ 2 -' 


max 


POk 

P^k-l 2 


Po, 


< 10 . 


( 6 ) 


Then, as long as the warm start samples X^-i and are (approximately) distributed aecording 
to P 9 ^_i, the random walk HitAndRun mixes to Pg^, with N = 0{n^) steps. That is, the output 
samples X^ and Y^ are distributed aecording to Pg^ up to error < e. 

In the appendix we sketch the proof of this theorem for completeness. 

Corollary 1. The temperature schedule t^ ■= {l — l/^/n)^tl satisfies condition Q, and thus 
Algorithm^ with this schedule returns an e-approximate solution in time 

Proof. By equation (© , to achieve e error it suffice that j > ^, or in other words k needs to be large 
enough such that (1“;^)*' < ^ for which/c = Sy^log” suffices: (1 — ;^)^ < e = g-^iogj < 

Hence the temperature schedule need be applied with T = 0{^/n) iterations. Each iteration requires 
0{n) applications of HitAndRun that cost O(n^), for the total running time of □ 

In later sections we proceed to give a more refined temperature schedule that satisfies the Kalai- 
Vempala conditions, and thus results in a faster algorithm. Our temperature schedule is based on 
new observations in interior point methods, which we describe next. 


2.1 The heat path for simulated annealing 

Our main result follow from the observation that the path-following interior point method has an 
analogue in the random walk world. Simulated annealing incorporates a carefully chosen tempera¬ 
ture schedule to reach its objective from a near-uniform distribution. We can think of all tempera¬ 
ture schedules as performing a random process whose changing mean is a single well-defined curve. 
For a given convex set K, C and objective 9, define the heat path as the following set of points, 
parametrized by the temperature t G [0, oo] as follows: 

HEATPATH(t) = E [x] 

^^Pe/t 

We can now dehne the heat path as HeatPath = Ut>o{HEATPATH(t)}. At this point it is not yet 
clear why this set of points is even a continuous curve in space, let alone equivalent to an analogous 
notion in the interior point world. We will return to this equivalence in section]^ 


3 An Overview of Interior Point Methods for Optimization 

Let us now review the vast literature on Interior Point Methods (IPMs) for optimization, and in 
particular the use of the Iterative Newton Step technique. The first instance of polynomial time 
algorithms for convex optimization using interior point machinery was the linear programming 
algorithm of Karmarkar [8]. The pioneering book of Nesterov and Nemirovskii m brought up 
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techniques in convex analysis that allowed for polynomial time algorithms for much more general 
convex optimization, ideas that are reviewed in great detail and clarity in m • 

The goal remains the same, to solve the linear optimization problem posed in Equation Q. The 
intuition behind IPMs is that iterative update schemes such as gradient descent for solving Q can 
fail because the boundary of 1C can be difficult to manage, and “moving in the direction of descent” 
will fail to achieve a fast rate of convergence. Thus one needs to “smooth out” the objective with 
the help of an additional function. In order to produce an efficient algorithm, a well-suited type of 
function is known as a self-concordant harrier which have received a great deal of attention in the 
optimization literature. 

A self-concordant barrier function (p : int(/C) —)• M, with barrier parameter, is a convex function 
satisfying two differential conditions: for any h G M"" and any x G /C, 

V^ip[h,h,h] < 2(V^(/?[/i,and 

Vp[h] < y/l^VMh,h], (7) 


in addition to the property that the barrier should approach infinity when approaching the boundary 
of 1C. Such function possess very desirable properties from the perspective of optimization, several 
of which we discuss in the present section. We note an important fact: while the existence of such 
a function p for general sets 1C has been given by Nesterov and Nemirovskii [16] —the universal 


barrier with parameter parameter u = 0(n), to be discussed further in Section 4.4—an efficient 
construction of such a function has remained elusive and was considered an important question in 
convex optimization. This indeed suggests that the annealing results we previously outlined are 
highly desirable, as HitAndRun requires only a membership oracle on 1C. However, one of the 
central results of the present work is the equivalence between annealing and IPMs, where we show 
that sampling gives one implicit access to a particular barrier function. This will be discussed at 
length in Section]^ 

Let us now assume we are given such a function p with barrier parameter v. A standard 
approach to solving 0 is to add the function p{x) to the primary objective, scaling the linear term 
by a “temperature” parameter t > 0: 


min {tO'x -\- p{x)}. 
x&K. 


( 8 ) 


As the the temperature t tends to oo the solution of Q will tend towards the optimal solution to 
This result is proved for completeness in Theorem]^ 

Towards developing in detail the iterative Newton algorithm, let us define the following for 
every positive integer k: 

tk := for some c > 0, (9) 

^k{x) ■= tkO'^X-\-(fix) 

Xk := argmin<kfc(x) 

X 

As y? is a barrier function, it is clear that x^ is in the interior of K and, in particular, V<I>fc(xfc) = 
0 =i> V(p{xk) = tkO. It is shown in [T5| (Equation 3.6) that any u-SCB (Self-Concordant Barrier) 
if satisfies Vp{x)~^ {y — x) < u, whence we can bound the difference in objective value between x^ 
and the optimal point x*: 

^^{XkVix* - Xk) 


9' {x* - Xk) = 


I'k 


V 

< —. 
tk 


( 10 ) 





We see that the approximation point Xk becomes exponentially better as k increases. Indeed, 
setting k = \^ log(z//e)] guarantees that the error is bounded by e. 

The iterative Newton’s method technique actually involves approximating with a near- 
optimal maximizer of at each iteration k. For an arbitrary v G M”", x G int(/C), and any A: > 1, 


following [15] 

we define: 



IklU 

:= v'^V‘^ip{x)v, 

the “local norm” of v w.r.t x; 

(11) 

ll^llx 

:= v'^V~'^ip{x)v, 

the corresponding dual norm of v, 

(12) 

A(x,4) 

■■= l|V<hfc(x)||:, 

the Newton decrement of x for temperature tk- 

(13) 


Note that, for a hxed point x £ K, the norms || • ||a; and || • ||* are dual to each othei0 It will turn 
out that X{x,tk) will be used both as a quantity in the algorithm, and as a kind of potential that 
we need to keep small. 

In Algorithm we describe the damped newton update algorithm, henceforth called Itera- 
tiveNewtonStep. We note that the 


Algorithm 3: IterativeNewtonStep 

1 Input: 6 G 1C and barrier function ip 

2 Solve: xq = argmaxa;gx: x + (p{x) 

3 for /c = 1, 2,... do 

4 [ Xfc ^ Xk-1 - i+A(xI_i,4) V~V(^fc-l)V$fc(xfc-l) 


The most difficult part of the analysis is in the following two lemmas, which are crucial elements 
of the IterativeNewtonStep analysis. A full exposition of these results is found in the excellent 
survey m- The first lemma tells us that when we update the temperature, we don’t perturb 
the Newton decrement too much. The second lemma establishes the quadratic convergence of the 
Newton Update for a fixed temperature. 

Lemma 1. Let c be the constant chosen in the definition (§. Let t > 0 be arbitrary and let 
t' = t . Then for any x G infilC), we have A(x, t') < (1 + c)A(x, t) + c. 

Lemma 2. Let k be arbitrary and assume we have some Xfc-i such that X{xk-i,tk) is finite. The 
Newton update Xk satisfies X{xk,tk) < 2A^(xfc_i,tfc). 

The previous two lemmas can be combined to show that the following invariant is maintained. 
Neither the constant bound of 1/3 on the Newton decrement nor the choice of c = 1/20 are 
particularly fundamental; they are convenient for the analysis but alternative choices are possible. 

Lemma 3. Assume we choose c = 1/20 for the parameter in (1^. Then for all k we have X{xk, tk) < 
i 

3 • 

^Technically, for || • |h and its dual to be a norm, we need to be positive definite and ip to be strictly convex. 
One can verify this is the case for bounded sets, which is the focus of this paper. 
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Proof. We give a simple proof by induction. The base case is satisfied since we assume that 
A(foTo) = 0, as to = 10 For the inductive step, assume A(xfc-i, tfc-i) < 1/3. Then 

A(xfc,tfc) ^ 2A 

< 2((1 + c)X{xk-utk-i) + c)2 < 2(0.4)2 < 1/3. 


The first inequality follows by Lemma and the second by Lemma hence we are done. □ 

Theorem 2. Let x* be a solution to the objective Q. For every k, Xk is an Ck-approximate solution 
to Q, where • In particular, for any e > 0, as long as k > ^ log(2z^/e) then Xk is an 

e-approximation solution. 


Proof. Let k be arbitrary. Then, 



0~^{Xk - X*) 


9~^{xk - X*) + 9~^{xk 

(By 

< 

^ -b 9~''{xk - Xk) 

(Holder’s Inequality) 

< 

// + Mijxk - Xkt 

{[TS] Eqn. 2.20) 

< 

U 1 11 P 11 * 

tk IIC'IISj, i-A(£fc,4) 

(Applying Lemma|^and (|l^ ) 

< 

p 1 * 1 

tk ' tk - 4 


< 


v+s/vlA: 


The last equation utilizes a fact that derives immediately from the definition Q, namely 

\\V^*{x)\\l = \\Vip*{x)\\e^,)<V^ 
holds for any SCBF cp with parameter n, and for any x € fC. 


(14) 

□ 


We proceed to give a specific barrier function that applies to any convex set and gives rise to 
an efficient algorithm. 


4 The Equivalence of IterativeNewton and SimulatedAnnealing 

We now show that the previous two techniques, Iterative Newton’s Method and Simulated Anneal¬ 
ing, are in a certain sense two sides of the same coin. In particular, with the appropriate choice of 
barrier function ip the task of computing the sequence of Newton iterates xi,X 2 ,... may be viewed 
precisely as estimating the means for each of the distributions , Pe ^, • • ■ • This correspondence 
helps to unify two very popular optimization methods, but also provides three additional novel 
results: 

1. We show that the heat path for simulated annealing is equivalent to the central path with the 
entropic barrier. 

^As stated, Algorithm 0 requires finding the minimizer of ip(-) on K., but this is not strictly necessary. The 
convergence rate can be established as long as a “reasonable” initial point xo can be computed—e.g. it suffices that 

Affo,!) < 1/2. 
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2. We show that the running time of Simulated Annealing can be improved to from the 

previous best of In the most general case we know that v = 0{n), but there are many 

convex sets in which the parameter v is significantly smaller. Notably such is the case for the 
positive-semi-definite cone, which is the basis of semi-definite programming and a cornerstone 


of many approximation algorithms. Further examples are surveyed in section 4.4 


3. We show that Iterative Newton’s Method, which previously required a provided barrier function 
on the set /C, can now be executed efficiently where the only access to /C is through a membership 
oracle. This method relies heavily on previously-developed sampling techniques [7]. Discussion 
is deferred to Appendix [C} 


4.1 The Duality of Optimization and Sampling 

We begin by rewriting our Boltzmann distribution for 8 in exponential family form, 

Pe^x) := exp{—6~^X — A{9)) where A{6) := log exp{—9~^x')dx'. (15) 

The function A(-) is known as the log partition function of the exponential family, and it has several 
very natural properties in terms of the mean and variance of Pg: 

VA{9) = - E [X] (16) 

^^A{9) = E [(A-EA:)(X-EA:)^]. (17) 

We can also appeal to Convex (Fenchel) Duality [I^ to obtain the conjugate 

A*{x) := svlPq9~^X — A{9). (18) 

It is easy to establish that A* is smooth and strictly convex. The domain of A*(-) is precisely the 
space of gradients of A(-), and it is straightforward to show that this is the set int(—/C), the interior 
of the reflection of JC about the origin. However we want a function defined on (the interior of) /C, 
not its reflection, so let us define a new function A*_{x) := A*{—x) whose domain is int(/C). We 
now present a recent result of Bubeck and Eldan [2j . 

Theorem 3 ([2])' The function Af is a u-self-concordant harrier funetion on 1C with v < n(l -|- 

0 ( 1 )). 

One of the significant drawbacks of barrier/Newton techniques is the need for a readily-available 
self-concordant barrier function. In their early work on interior point methods, Nesterov and 
Nemirovskii |16) provided such a function, often referred to as the “universal barrier”, yet the 
actual construction was given implicitly without oracle access to function values or derivatives. 
Bubeck and Eldan [2] refer to function AF(-) as the entropic barrier, a term we will also use, as it 
relates to a notion of differential entropy of the exponential family of distributions. 

It is important to note that, when our set of interest is a cone K, the entropic barrier on K 
corresponds exactly to the Fenchel dual of the universal barrier on the dual cone K* [6], which 
immediately establishes the self-concordance. Indeed, many beautiful properties of the entropic 
barrier on cones have been developed, and for a number of special cases AF(-) corresponds exactly 
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to the canonical barrier used in practice; e.g. ^l(-) on the PSD cone corresponds to the log- 
determinant barrier. In many such cases one obtains a much smaller barrier parameter v than the 


n(l -I- o(l)) bound. We defer a complete discussion to Section 4.4 

In order to utilize j 41(-) as a barrier function as in Q we must be able to approximately solve 
objectives of the form x + 741(x)}. One of the key observations of the paper, given in 

the following Proposition, is that solving this objective correponds to computing the mean of the 
distribution Pq. 


(19) 


Proposition 1. Let 9 G M” be arbitrary, and let Pg be defined as in (15). Then we have 

E [X] = aT:gmml6~^X + A*_{x)\ . 

x&nt(K.) ^ ' 

Proof. Let 9 be an arbitrary input vector. As we mentioned in (|^, standard Fechel duality theory 
gives us 

VA{9) = argmax = argmin |—+ A*(x)| . 


xSdom(A*) 


It can be verified that the domain of A* is precisely the interior of —1C, the reflection of JC. Thus 
we have 

XA{9) = argmin <—9^x + A*(x) | = — I argmin 1 9^x + A*_{x) | ] . 

^ ya:Sint(/C) ^ j 

In addition, we noted in (15) that VA{9) = — Ex~Pt)[-T]. Combining the latter two facts gives the 
result. □ 


We now have a direct connection between sampling methods and barrier optimization. For the 
remainder of this section, we shall assume that our chosen (p{-) is the entropic barrier A*(-), and 
the quantities || • ||a;,A(-,-) are defined accordingly. We shall also use the notation x{9) := 

Ex-pjX] = -XA{9). 

Lemma 4. Let 9,9' he such that ||x(0') — x(0)||a,(e) < Then we have 

DAt{x{9'),x{9)) = KL{P',Pg) = Da{ 9,9') < 2\\x{9') - x{9)\\l^^g^ (20) 

Proof. The duality relationship of the Bregman divergence, and its equivalence to Kullback-Leibler 
divergence, is classical and can be found in, e.g., [20] equation (5.10) The final inequality follow as 
a direct consequence of m, Equation 2.4. fH 


4.2 Equivalence of the heat path and central path 

The most appealing observation on the equivalence between random walk optimization and interior 
point methods is the following geometric equivalence of curves. For a given convex set 1C C and 
objective 9, define the heat path as the following set of points: 


Heat Path = U { He at Path (t)} = U { E [x]} 

t>0 t>0 a:~Pg 

J 

To see that this set of points is a continuous curve in space, consider the central path w.r.t. 
barrier function ip{x): 

CENTRALPATH(y?) = U {CENTRALPATH(t, V^)} = U {argmin {t9~^X + f{x)} 
t>0 t>0 x^K 
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Figure 2: For a set of seven different temperatures t, we used Hit-and-Run to generate and scatter 
plot several samples from using colored circles. We also computed the true means for each 
distribution, plotted with squares, giving a curve representing the HeatPath across the seven 
temperatures. Of course via Corollary this corresponds exactly to the CentralPath for the 
entropic barrier. 


It is well known that the central path is a continuous curve in space for any self-concordant 
barrier function (j). We now have the following immediate corollary of Proposition 

Corollary 2. The central path corresponding to the self-concordant barrier A* over any set K is 
equivalent to the heat path over the same set, i.e. 

HeatPath = CentralPath (H*) 

This mathematical equivalence is demonstrated in figure generated by simulation over a 
polytope. 

4.3 IPM techniques for sampling and the new schedule 

We now prove our main theorem, formally stated as: 

Theorem 4. The temperature schedule of 9i = R where R = diam(/C) and 9k ■= ^k-i, 

for u being the self-concordance parameter of the entropic barrier for the set JC, satisfies condition 
Q of theorem Therefore algorithm with this schedule returns an e-approximate solution in 
time 

Condition Q is formally proved in the following lemma, which crucially uses the interior point 
methodology, namely Lemma 
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Lemma 5. Consider distributions Pg and Pg/ where 9' = {1 + 7)0 for 7 < Then we have the 

following bound on the I 2 distance between distributions: 


max • 



Pe 


.P(l-|-7)0 

1 


P{l+-y)9 

2 

Pe 

2 J 


< 10 


Proof. We first show by elementary linear algebra that 

11^0/^(1-7)011 = ||P0/P(i+^)0|| =exp(ZlA((l+ 7 ) 0 , 0 )+ ^a(( 1 - 7 ) 0 , 0 )). 

Let us consider the log of the 2-norm: 

exp(—— A{6)) 


log\\Pg/P^i+^)g\\ = log [ 

Jk 


dPg 


fC exp(-(l -h 7 ) 0 "^- ^((1 + 7 ) 0 )) 

log [ exp( 70 '^x - A{9) -h ^((1 -h j)9))dPg 
Jk. 


= log / eyiY>{'^d'x — A{9)+A{{l + '~^)9))eyi-p{—9'x — A{d))dx 

Jk 

= 74 (( 1 -|- 7 ) 0 ) — 2j 4(0)-|-log f exp{—{l —'j)6~^x)dx 

Jk 

= 7l((l+7)0)-2^(0)+7l((l-7)0) 

= Z1a((1+7)0,0)+ ^a((1-7)0,0). 

Replacing 7 by — 7 , we get a completely symmetrical expression. Next, we observe that 

\\Pe{l+-y)/Pe\\ = \\Pg/Pg(i-^)\\ 

where 0 = 0(1 + 7 ) and 1 — 7 = = 1 — thus 7 G 7 x [1, 2], By this observation, both sides 

of the lemma follow if we prove an upper bound 


P 


(1+7)0 


< 2 


— ^ \ l-c 


II 

CN 

X 

.7 

V 

3^/n 

■)A(x(0),l) + c = c< J. 

Applying Lemma 

-x ((1 + 7)0)||2((i^^)^) 

Lemma [4] 

/ A(x(0), 1 - 1 - 7 ) 
(,l-A(+0),l-r7)/ 

(2.28) in dS] 

c . n(l/3)2 _ 1 

-cj ^ JPW ~ 2 

Lemma [T] 


Notice that to apply Lemmaj^ we need the point x((l + 7 ) 0 ) to he in the Dikin ellipsoid of x(0), 
which is exactly whats proved in the last two lines of the above proof. 

The bound on D^((l— 7 ) 0 , 0) follows in precisely the same fashion, by similar change of variables 
as before (again, the condition for applying Lemmaj^is proven in the last few lines of the equations 
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below): 


DA{ii--f)e,e) 


= DA{e,e{i + ^)) 

< 2 || x ( 0 ) - x ((1 + 7 ) 0 )|| 2 ^-^ 


< 2 
< 2 


C 

1-C 


A(x(6>)7+7) \ 

1-A(x(0),1+7); 
2 


^ oimi _ 1 

^ (2/3)2 “ 2 


Lemma |4] 
(2.27) in dS] 
Lemma [T] 


It follows that WPq/< QDA({i+^)d,e)+DA{(i 1 ) 6 , 8 ) ^ g 2 + 2 < 10. 


□ 


4.4 Some history on the entropic barrier and the universal barrier for cones 

Let 77 be a cone in and let K* = {6 : 6 ~^x > 0 Vx G 77} be its dual cone. We note that a 
cone 77 is homogeneous if its automorphism group is transitive; that is, for every x,y € K there 
is an automorphism B : K ^ K such that Bx = y. Homogeneous cones are very rare, but one 
notable example is the PD cone (matrices with all positive eigenvalues). Given any point x G 77, 
we can define a truncated region of 77* as the set 77*(x) := {y G 77* : x~^y < 1}. Nesterov and 
Nemirovskii m defined the first generic self-concordant barrier function, known as the universal 
barrier in terms of these regions. Namely, they show that the function 


uk{x) := log(vol(77*(x))) 

is a self concordant barrier function with an 0(n) parameter. 

There is an alternative characterization of the universal barrier in terms of the larg partition 
function. Let Fk{x) ■= logexp(0''^x)(70 and equivalently let Fk*{0) '■= log ex.p{9~^x)dx. It 
was shown by Giiler [5] that 

Fk{x) = uk{x) + n!, 

that is, the universal barrier corresponds exactly to a log-partition function but defined on the 
dual cone 77*, modulo a simple additive constant. We note that this is not the entropic barrier 
construction we have here, as our function of interest is A*_{- (•) (the Fenchel conjugate of 

Fk*{x)), and not Fk{x). However, it was also shown by Giiler [Jj that, when 77 is a homogeneous 
cone, we have the identity Fk{-) = 7^}},(-); in other words, the universal barrier and the entropic 
barrier are equivalent for homogeneous cones. 

It is worth noting that, following the connection of Giiler [5], ^L(-) is (up to additive constant) 
the Fenchel conjugate of the universal barrier uk* for 77*. It was shown by Nesterov and Nemirovskii 
|16| (Theorem 2.4.1) that Fenchel conjugation preserves all required self concordance properties 
and in particular if g is a i/-self-concordant barrier for a cone 77, then g* will be a self-concordant 
barrier for 77* with the same parameter u. With this observation it follows immediately that the 
entropic barrier F'kX) on 77 is an 0(n)-self-concordant barrier. Bubeck and Eldan [2] took this 
statement further, proving that F'^,{-) enjoys an essentially optimal self-concordance parameter 
V = n(l -|- 0 ( 1 )). 

It is important to note that the assumption that the set of interest is a cone is, roughly speaking, 
without loss of generality. Given any convex set U C M"" we have the fitted cone K{U) := {a(x, 1) : 
X G [/, a > 0} C Hence once can always work with the barrier function F^(^).(-) on 77(77), 
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and take its restriction to the set C/ x {1} C K[U) to obtain a barrier on U (affine restriction 
preserves the barrier properties). 

We conclude by summarizing several results in Giiler [5] regarding the entropic barrier for 
various cones, as well as the associated barrier parameter of each. In these canonical cases the 
entropic barrier corresponds exactly to the “typical” barrier, up to additive and multiplicative 
constants. We use the notation /(•) = g{-) to denote that / and g are identical up to additive 
constants. 

1. Assume K := M” the nonnegative orthant. This is a homogeneous cone and we have Fk{x) = 
F^t{x) = — XlILi logXj. This is the optimal barrier for K and the barrier parameter is = n. 

2. Assume A' := {x G : xf + . . . + x^_^ < x^} be the Lorentz cone. AT is a homogeneous self-dual 
cone. K can also be described by the fitted cone of the radius-1 L2 ball, so we may parameterize 
elements of K as a(x, 1) where a > 0 and x is vector in with L2 norm bounded by 1. Then 
FK{a{x, 1)) = F^t{a{x, 1)) = —log(a^ — x^). This barrier has parameter u = n + 1 which 
is indeed not optimal, one has the optimal barrer — log(a^ — x^) which has parameter u = 2, 
but this is simply a scaled version of the entropic barrier. 

3. The PSD cone K of positive semi-definite matrices, i.e. symmetric matrices with non-negative 

eigenvalues, is a homogeneous self-dual cone. The entropic barrier is Fk{x) = A^.(x) = 
— logdet(x) and exhibits a parameter of which is multiplicatively worse 

than the optimal barrier, the log-determined — logdet(x). However, as can be seen this barrier 
is quite simply a scaled version of the entropic barrier. 
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A An Explanation of Newton’s Method via Reweighting 

Proposition brings out a strong connection between interior point techniques and the ability to 
sample from Boltzmann distributions. But with this stochastic viewpoint, it is not immediately 
clear why Newton’s method is an appropriate iterative update scheme for optimization. We now 
provide some evidence along these lines. 
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Assuming we have already computed (an approximation of) x{0), and our distribution parameter 
is updated to a “nearby” 0 ', our goal is now to compute the new mean x{0'). 


—x{9') = f xdPer 

Jk. 



dPg'jx) 

dPg{x) 


dPe 


E 


X exp(X^(0 - 9') + A{9') - A{0)) 


Think of the last term as the reweighting factor. Now we are going to rewrite A{9) — A{9') = 
—VA{9){9' — 9) — Da{9',9) = x{9y' {9' — 9) — ¥Ai{Pg,P 0 i). We shall use the following approximation 
of the exponential: exp(z) ~ 1 + z for small values of z. Proceeding, 


-x{9') 


E 


Xexp(X^(0-0') - 


gKL(Pe,Pe,) jg 

Xr^Pg 


X exp((X 


gKL(Pe.Pe,) jg 

Xr^Pg 


X{l + {X 


xi9)'^{9' -9)+ KL{Pe,Pe,)) 
- x{9)y {9'- 9)) 
-x{9)y{P-9)) 


e^^^Pg’Po'\-x{9) + J:e{9' - 9)). 


Duality theory tells us that Eg = V‘^A{0) = V~‘^A*{x{0)) and 9' — 9 is precisely the gradient of the 
objective 9'~^x — A*{x) at the point x{9). The term is somewhat mysterious, but it can 

be interpreted as something of a “damping” factor akin to the Newton decrement damping of the 
the Newton update. 


B Proof structure of the Kalai-Vempala theorem 

We hereby sketch the structure of the proof of theorem for completeness. Recall the statement 
of the theorem: 

Algorithm with a temperature schedule that satisfies the following condition: 


The successive distributions are not “too far” in total variational distance. That is, for every j, 


max 


Pe. 


P 0 ,_i 


Pdi-i 


Pe. 


< 10 


Guarantees that HitAndRun requires N = 0{ri?) steps in order to ensure mixing to the 
stationary distribution Pg . 


Proof sketch. The proof is based on iteratively applying the following Theorem from |12) : 
Theorem 5. Let f be a density proportional to * over a convex set K such that 
[a], the level set of probability l/df contains a ball of radius s 
[bj. E/(||x — A^/lP) < S, where pi-f = E/[x] is the mean of f 

[c], the (.2 norm of the starting distribution a w.r.t. the stationary distribution of HitAndRun 
denoted vrj, is at most M. 
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Let IT™' he the distribution of the current point after m steps of HitAndRun applied to f. Then, 
for any r > 0, after m = log® steps, the total variation distance of and tt/ is less 

than T. 

The proof proceeds to show that conditions [a]-[c] of the theorem above are all satisfied if 
indeed condition Q is satisfied, along the steps below. Once it is established that the conditions of 
the theorem hold, then the next HitAndRun walk mixes and computes warm start and variance 
estimates for the next epoch. Then again, the conditions of the theorem hold, and this whole 
process is repeated for the entire temperature schedule. 

To show conditions [a]-[c], first notice that condition Q is essentially equivalent to condition 
[c] above. Thus we only need to worry about conditions [a],[b]. 

[I]. For simplicity, we assumed that at the current iteration, Tij = / is the identity. This can 
be assumed by a transformation of the space, and allows for simpler discussion of isotropy 
of densities (otherwise, the isotropy condition would be replaced by relative-isotropy w.r.t 
the current variance). 

[II]. A density / is C-isotropic if for any unit vector ||u|| = 1, 

^<[ p,f)ff{x)dx <C 

JR" 

It is shown (Lemma 4.2) that if the density given by / is 0(l)-isotropic, then conditions 
[a],[b] are satisfied with ^ = 0 {y/n). 

[III] . It is shown (Lemma 4.3) that if / is C-isotropic, and max 

CZ)-isotropic. 

[IV] . Since condition Q holds, together with the previous points [II,III] this implies that is 

isotropic for some constant. Thus, conditions [a]-[c] of Theorem hold. Therefore we can 
sample sufficiently many samples to estimate the covariance matrix Sj+i and proceed to 
the next epoch. 

Throughout the proof special care needs to be taken to ensure that repeated samples are nearly- 
independent for various concentration lemmas to apply, we omit discussion of these and the reader 
is referred to the original paper of [7|. 

□ 


< D, then g is 


C Interior point methods with a membership oracle 


Below we sketch a universal IPM algorithm - one that applies to any convex set described by a 
membership oracle - that can be implemented to run in polynomial time. This algorithm is an 
instantiation of Algorithm with the particular barrier function A*{x) as defined in section 

Without loss of generality, we can assume our goal is to (approximately) compute the update 
direction 


4.1 


V-‘^A*{x){e -VA*{x)) 
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for some x which is already within the Dikin ellipsoid of radius 1/2 around x{9). First, we note 
that the IPM analysis of |15) allows one to replace the inverse hessian A* {x) with the nearby 
V-M* {x{9)) = CovMtx(P 6 i)- Of course the latter can be estimated via sampling, in the sense that 
the estimate S will be “e-isotropically close”: 

(1 - e)v'^V^^{9')v < v'^tv < (1 + e)v'^V‘^^{9')v 

for any unit vector v. See, for example, [I] on the concentration of empirical covariance matrices. 
It remains to compute S/A*{x). Define 9{x) to be 

6 *(x) = argmax0 • X — log / exp{—9-y)dy = VA*{x) ( 21 ) 

® Jk 

Then 9(x) can be computed in polynomial time by another interior point algorithm - this problem, 
however, is much simpler to work with. Define T(0') := log exp{—9-y)dy to be the objective 
we want to optimize. Notice that V'^{9') = x — 'Kx'^Pg,[^'] and the latter can be estimated to 
within e via SimulatedAnnealing with samples. The hessian = —CovMtx(P 0 /) 

can similarly be estimated with an e-isotropically close empirical covariance. Because the error gap 
is multiplicatively close to 1 , the inverse operation on V^'I'( 0 ') maintains the approximation. 
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